# MMMU Evaluation Guidelines
We provide detailed instructions for evaluation. 
To execute our evaluation script, please ensure that the structure of your model outputs is the same as ours.

## Transformers version
```
pip install transformers==4.43.0
cd hae_phi3.5
```
We provide two options:
1. Evaluation only: you can parse the response on your own and simply provide one file with all the final predictions.
2. Parse and evaluation: you can leave all the responses to us with the output formats shown below.

## Evaluation Only
If you want to use your own parsing logic and *only provide the final answer*, you can use `main_eval_only.py`.

You can provide all the outputs in *one file* in the following format:

```
{
    "validation_Accounting_1": "D", # strictly "A", "B", "C", "D" for multi-choice question
    "validation_Architecture_and_Engineering_14": "0.0", # any string response for open question.
    ...
}
```
Then run eval_only with:
```
python main_eval_only.py --output_path ./example_outputs/Phi3.5_eval/total_val_output.json
```

Please refer to [example output](https://github.com/MMMU-Benchmark/MMMU/blob/main/eval/example_outputs/llava1.5_13b/total_val_output.json) for a detailed prediction file form.

## Run Phi3.5_Vision_Instruct
You can also provide response and run the `run_phi3.5_mmmu_inference.py` to get the answers for MMMU tasks.

```
python run_phi3.5_mmmu_inference.py --path ./example_outputs/Phi3.5_eval --subject ALL # all subject

# OR you can sepecify one subject for the evaluation

python run_phi3.5_mmmu_time.py   # Get the inference time 

```

#  Story Generation Evaluation Guidelines

```
python hae-story-generate.py  # To Generate the Story summarization
```
Then, you need to combine all the samples to 'output_final.jsonl', the scripts for results is in Path 'result'

-result
--replace.py # Regular Text
--togatherhae.py # combine all results

## Score eval

The DeepSeek Evaluate Scripts in eval.
-eval
--Deepseek_score_eval.py # provide the API_key to evaluate the scores
--Extract_score.py # to get the score from the generation text
--average_score.py # compute the average score for paper

